Search CORE

141 research outputs found

Low-power Programmable Processor for Fast Fourier Transform Based on Transport Triggered Architecture

Author: Takala Jarmo
Žádník Jakub
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/04/2019
Field of study

This paper describes a low-power processor tailored for fast Fourier transform computations where transport triggering template is exploited. The processor is software-programmable while retaining an energy-efficiency comparable to existing fixed-function implementations. The power savings are achieved by compressing the computation kernel into one instruction word. The word is stored in an instruction loop buffer, which is more power-efficient than regular instruction memory storage. The processor supports all power-of-two FFT sizes from 64 to 16384 and given 1 mJ of energy, it can compute 20916 transforms of size 1024.Comment: 5 pages, 4 figures, 1 table, ICASSP 2019 conferenc

arXiv.org e-Print Archive

Crossref

Trepo - Institutional Repository of Tampere University

pocl: A Performance-Portable OpenCL Implementation

Author: Berg Heikki
de La Lama Carlos Sánchez
Jääskeläinen Pekka
Raiskila Kalle
Schnetter Erik
Takala Jarmo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

arXiv.org e-Print Archive

Trepo - Institutional Repository of Tampere University

Memory-Based FFT Architecture with Optimized Number of Multiplexers and Memory Usage

Author: Garrido Mario
Kaya Zeynep
Takala Jarmo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2023
Field of study

This brief presents a new P-parallel radix-2 memory-based fast Fourier transform (FFT) architecture. The aim of this work is to reduce the number of multiplexers and achieve an efficient memory usage. One advantage of the proposed architecture is that it only needs permutation circuits after the memories, which reduces the multiplexer usage to only one multiplexer per parallel branch. Another advantage is that the architecture calculates the same permutation based on the perfect shuffle at each iteration. Thus, the shuffling circuits do not need to be configured for different iterations. In fact, all the memories require the same read and write addresses, which simplifies the control even further and allows to merge the memories. Along with the hardware efficiency, conflict-free memory access is fulfilled by a circular counter. The FFT has been implemented on a field programmable gate array. Compared to previous approaches, the proposed architecture has the least number of multiplexers and achieves very low area usage.publishedVersionPeer reviewe

Trepo - Institutional Repository of Tampere University

Zero-Quantized Inter DCT Coefficient Prediction for Real-Time Video Coding

Author: Jarmo Takala
Jin Li
Moncef Gabbouj
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Efficient Software Synthesis of Dynamic Dataflow Programs

Author: Casseau Emmanuel
Jääskeläinen Pekka
Raulet Mickaël
Sanchez Alexandre
Takala Jarmo
Yviquel Hervé
Publication venue: HAL CCSD
Publication date: 05/05/2014
Field of study

International audienceThis paper introduces advanced software synthesis techniques that enhance the implementation of dynamic dataflow programs. These techniques have been implemented into open-source tools and demonstrated on well-known video decoders including one based on the new High Efficiency Video Coding (HEVC) standard. The results show an improvement of more than 100% of the frame-rate over previously proposed implementations, and achieve real-time decoding of high definition video sequences

INRIA a CCSD electronic archive server

OpenCL-based design methodology for application-specific processors

Author: Carlos S De La Lama
Jarmo H Takala
Pablo Huerta
Pekka O Jääskeläinen
Publication venue
Publication date: 01/01/2010
Field of study

Abstract-OpenCL is a programming language standard which enables the programmer to express the application by structuring its computation as kernels. The OpenCL compiler is given the explicit freedom to parallelize the execution of kernel instances at all the levels of parallelism. In comparison to the traditional C programming language which is sequential in nature, OpenCL enables higher utilization of parallelism naturally available in hardware constructs while still having a feasible learning curve for engineers familiar with the C language. This paper describes methodology and compiler techniques involved in applying OpenCL as an input language for a design flow of application-specific processors. At the core of the methodology is a whole program optimizing compiler that links together the host and kernel codes of the input OpenCL program and parallelizes the result on a customized statically scheduled processor. The OpenCL vendor extension mechanism is used to provide clean access to custom operations. The methodology is studied with a design case to verify the scalability of the implementation at the instruction level and to exemplify the use of custom operations. The case shows that the use of OpenCL allows producing scalable application-specific processor designs and makes it possible to gradually reach the performance of hand-tailored RTL designs by exploiting the OpenCL extension mechanism to access custom hardware operations of varying complexity

CiteSeerX

Trepo - Institutional Repository of Tampere University

Sonification of Markov-chain Monte Carlo Simulations

Author: Hansen Mark H.
Hermann Thomas
Hiipakka Jarmo
Ritter Helge
Takala Tapio
Zacharov N.
Publication venue: Laboratory of Acoustics and Audio Signal Processing and the Telecommunications Software and Multimedia Laboratory
Publication date: 01/01/2001
Field of study

Hermann T, Hansen MH, Ritter H. Sonification of Markov-chain Monte Carlo Simulations. In: Hiipakka J, Zacharov N, Takala T, eds. Proceedings of 7th International Conference on Auditory Display. Helsinki University of Technology: Laboratory of Acoustics and Audio Signal Processing and the Telecommunications Software and Multimedia Laboratory; 2001: 208-216.Markov chain Monte Carlo (McMC) simulation is a popular computational tool for making inferences from complex, high-dimensional probability densities. Given a particular target density , the idea behind this technique is to simulate a Markov chain that has as its stationary distribution. To be successful, the chain needs to be run long enough so that the distribution of the current draw is close to the target density. Unfortunately, very few diagnostic tools exist to monitor characteristics of the chain. In this paper, we present a new approach to render sonifications of McMC simulations. The proposed method consists of several auditory streams which provide information about the behavior of the Markov chain. In particular, we focus on uncovering modes in the target density function. In addition to monitoring, we have found our sonification to be an effective means for understanding the structure of high-dimensional densities. We have also applied our method to the exploratory analysis of high-dimensional data sets. In this case, we take as our target a non-parametric density estimate obtained from the data. In this paper, we present a detailed description of our sonification design and illustrate its performance on test cases consisting of both synthetic and real-world data sets. Sound examples are also given

Publications at Bielefeld University

Distance transform algorithm for bit-serial SIMD architectures

Author: Takala Jarmo
Viitanen Jouko
Publication venue: 'Elsevier BV'
Publication date: 01/01/1999
Field of study

VTT Research System

New Identical Radix-2^k Fast Fourier Transform Algorithms

Author: Fahad Fahad
Takala Jarmo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

The radix-2k fast Fourier transform (FFT) algorithm is used to achieve at the same time both a radix-2 butterfly and a reduced number of twiddle factor multiplication. In this paper we present a new identical radix-2^k FFT algorithms, which has same number of butterfly and twiddle factor multiplication. The difference is only in twiddle factor stage location in signal flow graph (SFG). Further, analyze these algorithms and is shown that the round-off noise of identical radix-22, radix-23, and radix-24 FFT algorithms at output is reduced 27%, 8%, 3% respectively.acceptedVersionPeer reviewe

Crossref

Trepo - Institutional Repository of Tampere University